{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Handy Utils to do Vector Search on Collections"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step-1: Configuration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from my_config import MY_CONFIG"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step-2: Connect to Vector Database\n",
    "\n",
    "Milvus can be embedded and easy to use.\n",
    "\n",
    "<span style=\"color:blue;\">Note: If you encounter an error about unable to load database, try this: </span>\n",
    "\n",
    "- <span style=\"color:blue;\">In **vscode** : **restart the kernel** of previous notebook. This will release the db.lock </span>\n",
    "- <span style=\"color:blue;\">In **Jupyter**: Do `File --> Close and Shutdown Notebook` of previous notebook. This will release the db.lock</span>\n",
    "- <span style=\"color:blue;\">Re-run this cell again</span>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/shalisha.witherspoonibm.com/Documents/DPK_RAG_UPDATE/data-prep-kit-outer/examples/rag-pdf-1/venv/lib/python3.12/site-packages/milvus_lite/__init__.py:15: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n",
      "  from pkg_resources import DistributionNotFound, get_distribution\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✅ Connected to Milvus instance: ./rag_1_dpk.db\n"
     ]
    }
   ],
   "source": [
    "from pymilvus import MilvusClient\n",
    "\n",
    "milvus_client = MilvusClient(MY_CONFIG.DB_URI)\n",
    "\n",
    "print (\"✅ Connected to Milvus instance:\", MY_CONFIG.DB_URI)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step-3: Setup Embeddings\n",
    "\n",
    "Two choices here. \n",
    "\n",
    "1. use sentence transformers directly\n",
    "2. use Milvus model wrapper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Option 1 - use sentence transformers directly\n",
    "\n",
    "# If connection to https://huggingface.co/ failed, uncomment the following path\n",
    "import os\n",
    "os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\n",
    "\n",
    "from sentence_transformers import SentenceTransformer\n",
    "\n",
    "embedding_model = SentenceTransformer(MY_CONFIG.EMBEDDING_MODEL)\n",
    "\n",
    "def get_embeddings (str):\n",
    "    embeddings = embedding_model.encode(str, normalize_embeddings=True)\n",
    "    return embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Option 2 - Milvus model\n",
    "from pymilvus import model\n",
    "\n",
    "# If connection to https://huggingface.co/ failed, uncomment the following path\n",
    "import os\n",
    "os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\n",
    "\n",
    "\n",
    "# embedding_fn = model.DefaultEmbeddingFunction()\n",
    "\n",
    "## initialize the SentenceTransformerEmbeddingFunction\n",
    "embedding_fn = model.dense.SentenceTransformerEmbeddingFunction(\n",
    "    model_name = MY_CONFIG.EMBEDDING_MODEL,\n",
    "    device='cpu' # this will work on all devices (KIS)\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sentence transformer : embeddings len = 384\n",
      "sentence transformer : embeddings[:5] =  [-0.05225766  0.04597738  0.04810531 -0.00142911 -0.0265453 ]\n",
      "milvus model wrapper : embeddings len = 384\n",
      "milvus model wrapper  : embeddings[:5] =  [-0.05225766  0.04597746  0.04810531 -0.00142899 -0.02654536]\n"
     ]
    }
   ],
   "source": [
    "# Test Embeddings\n",
    "text = 'Paris 2024 Olympics'\n",
    "embeddings = get_embeddings(text)\n",
    "print ('sentence transformer : embeddings len =', len(embeddings))\n",
    "print ('sentence transformer : embeddings[:5] = ', embeddings[:5])\n",
    "\n",
    "embeddings = embedding_fn([text])\n",
    "print ('milvus model wrapper : embeddings len =', len(embeddings[0]))\n",
    "print ('milvus model wrapper  : embeddings[:5] = ', embeddings[0][:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step-4: Do A  Vector Search\n",
    "\n",
    "We will do this to verify data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "\n",
    "\n",
    "## helper function to perform vector search\n",
    "def  do_vector_search (query):\n",
    "    query_vectors = [get_embeddings(query)]  # Option 1 - using sentence transformers\n",
    "    # query_vectors = embedding_fn([query])  # using Milvus model \n",
    "\n",
    "    results = milvus_client.search(\n",
    "        collection_name=MY_CONFIG.COLLECTION_NAME,  # target collection\n",
    "        data=query_vectors,  # query vectors\n",
    "        limit=5,  # number of returned entities\n",
    "        output_fields=[\"filename\", \"page_number\", \"text\"],  # specifies fields to be returned\n",
    "    )\n",
    "    return results\n",
    "## ----\n",
    "\n",
    "def  print_search_results (results):\n",
    "    # pprint (results)\n",
    "    print ('num results : ', len(results[0]))\n",
    "\n",
    "    for i, r in enumerate (results[0]):\n",
    "        #pprint(r, indent=4)\n",
    "        print (f'------ result {i+1} --------')\n",
    "        print ('search score:', r['distance'])\n",
    "        print ('filename:', r['entity']['filename'])\n",
    "        if 'page_number' in r['entity']:\n",
    "            print ('page number:', r['entity']['page_number'])\n",
    "        print ('text:\\n', r['entity']['text'])\n",
    "        print()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "num results :  5\n",
      "------ result 1 --------\n",
      "search score: 0.8241064548492432\n",
      "filename: granite.pdf\n",
      "text:\n",
      " ## 4.3 Optimization\n",
      "\n",
      "We use AdamW optimizer (Kingma &amp; Ba, 2017) with β 1 = 0.9, β 2 = 0.95 and weight decay of 0.1 for training all our Granite code models. For the phase-1 pretraining, the learning rate follows a cosine schedule starting from 3 × 10 -4 which decays to 3 × 10 -5 with an initial linear warmup step of 2k iterations. For phase-2 pretraining, we start from 3 × 10 -4 (1.5 × 10 -4 for 20B and 34B models) and adopt an exponential decay schedule to anneal it to 10% of the initial learning rate. We use a batch size of 4M-5M tokens depending on the model size during both phases of pretraining.\n",
      "\n",
      "To accelerate training, we use FlashAttention 2 (Dao et al., 2022; Dao, 2023), the persistent layernorm kernel, Fused RMSNorm kernel (depending on the model) and the Fused Adam kernel available in NVIDIA's Apex library. We use a custom fork of NVIDIA's MegatronLM(Shoeybi et al., 2019; Narayanan et al., 2021) for distributed training of all our models. We train with a mix of 3D parallelism: tensor parallel, pipeline parallel and data parallel. We also use sequence parallelism (Korthikanti et al., 2023) for reducing the activation memoryconsumption of large context length during training. We use Megatron's distributed optimizer with mixed precision training (Micikevicius et al., 2018) in BF16 (Kalamkar et al., 2019) with gradient all-reduce and gradient accumulation in FP32 for training stability.\n",
      "\n",
      "------ result 2 --------\n",
      "search score: 0.7872612476348877\n",
      "filename: granite.pdf\n",
      "text:\n",
      " ## 4.1 Two Phase Training\n",
      "\n",
      "Granite Code models are trained on 3.5T to 4.5T tokens of code data and natural language datasets related to code. Data is tokenized via byte pair encoding (BPE, (Sennrich et al., 2015)), employing the same tokenizer as StarCoder (Li et al., 2023a). Following (Shen et al., 2024; Hu et al., 2024), we utilize high-quality data with two phases of training as follows.\n",
      "\n",
      "- Phase 1 (code only training) : During phase 1, both 3B and 8B models are trained for 4 trillion tokens of code data comprising 116 languages. The 20B parameter model is trained on 3 trillion tokens of code. The 34B model is trained on 1.4T tokens after the depth upscaling which is done on the 1.6T checkpoint of 20B model.\n",
      "- Phase 2 (code + language training) : In phase 2, we include additional high-quality publicly available data from various domains, including technical, mathematics, and web documents, to further improve the model's performance in reasoning and problem solving skills, which are essential for code generation. We train all our models for 500B tokens (80% code and 20% language data) in phase 2 training.\n",
      "\n",
      "------ result 3 --------\n",
      "search score: 0.7863028049468994\n",
      "filename: granite.pdf\n",
      "text:\n",
      " ## 3 Model Architecture\n",
      "\n",
      "We train a series of code models of varying sizes based on the transformer decoder architecture (Vaswani et al., 2017). The model hyperparameters for these models are given in Table 1. For all model architectures, we use pre-normalization (Xiong et al., 2020): normalization applied to the input of attention and MLP blocks.\n",
      "\n",
      "Table 1: Model configurations for Granite Code models.\n",
      "\n",
      "| Model                                                                                                                            | 3B                                                       | 8B                                                      | 20B                                                    | 34B                                                    |\n",
      "|----------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------|---------------------------------------------------------|--------------------------------------------------------|--------------------------------------------------------|\n",
      "| Batch size Context length Hidden size FFN hidden size Attention heads Key-Value heads Layers Normalization Activation Vocab size | 2048 2048 2560 10240 32 32 (MHA) 32 RMSNorm swiglu 49152 | 1024 4096 4096 14336 32 8 (GQA) 36 RMSNorm swiglu 49152 | 576 8192 6144 24576 48 1 (MQA) 52 LayerNorm gelu 49152 | 532 8192 6144 24576 48 1 (MQA) 88 LayerNorm gelu 49152 |\n",
      "\n",
      "3B : The smallest model in the Granite-code model family is trained with RoPE embedding (Su et al., 2023) and Multi-Head Attention (Vaswani et al., 2017). This model use the swish activation function (Ramachandran et al., 2017) with GLU (Shazeer, 2020) for the MLP, also commonly referred to as swiglu. For normalization, we use RMSNorm (Zhang &amp; Sennrich, 2019) since it's computationally more efficient than LayerNorm (Ba et al., 2016). The 3B model is trained with a context length of 2048 tokens.\n",
      "\n",
      "8B : The 8B model has a similar architecture as the 3B model with the exception of using Grouped-Query Attention (GQA) (Ainslie et al., 2023). Using GQA offers a better tradeoff between model performance and inference efficiency at this scale. We train the 8B model with a context length of 4096 tokens.\n",
      "\n",
      "20B : The 20B code model is trained with learned absolute position embeddings. We use Multi-Query Attention (Shazeer, 2019) during training for efficient downstream inference. For the MLP block, we use the GELU activation function (Hendrycks &amp; Gimpel, 2023). For normalizing the activations, we use LayerNorm (Ba et al., 2016). This model is trained with a context length of 8192 tokens.\n",
      "\n",
      "34B : To train the 34B model, we follow the approach by Kim et al. for depth upscaling of the 20B model. Specifically, we first duplicate the 20B code model with 52 layers and then\n",
      "\n",
      "5 https://www.clamav.net/\n",
      "\n",
      "Figure 2: An overview of depth upscaling (Kim et al., 2024) for efficient training of Granite34B-Code. We utilize the 20B model after 1.6T tokens to start training of 34B model with the same code pretraining data without any changes to the training and inference framework.\n",
      "\n",
      "<!-- image -->\n",
      "\n",
      "remove final 8 layers from the original model and initial 8 layers from its duplicate to form two models. Finally, we concatenate both models to form Granite-34B-Code model with 88 layers (see Figure 2 for an illustration). After the depth upscaling, we observe that the drop in performance compared to 20B model is pretty small contrary to what is observed by Kim et al.. This performance is recovered pretty quickly after we continue pretraining of the upscaled 34B model. Similar, to 20B, we use a 8192 token context during pretraining.\n",
      "\n",
      "------ result 4 --------\n",
      "search score: 0.7587494254112244\n",
      "filename: granite.pdf\n",
      "text:\n",
      " ## 6.1.6 FIM: Infilling Evaluations\n",
      "\n",
      "Granite Code models are trained for code completion purposes using FIM objective, as described in Sec. 4.2. We use SantaCoder-FIM benchmark (Allal et al., 2023), for infilling evaluations which tests the ability of models to fill in a single line of code in Python, JavaScript, and Java solutions to HumanEval. We use greedy decoding and report the mean exact match for all the models. Table 9 shows that Granite Code models significantly outperforms StarCoder and StarCoder2 across all model sizes, demonstrating it to be\n",
      "\n",
      "Figure 3: Performance of Granite-8B-Code-Instruct, Mistral-7B-Instruct-v0.2, Gemma-7B-IT, and Llama-3-8B-Instruct on HumanEvalPack. Best viewed in color.\n",
      "\n",
      "<!-- image -->\n",
      "\n",
      "excellent well-rounded models for code completion use cases. Moreover, we observe no performance improvement in scaling the model sizes from 8B to 34B, indicating that smaller models are often more suitable for FIM code completion tasks.\n",
      "\n",
      "------ result 5 --------\n",
      "search score: 0.7569425702095032\n",
      "filename: attention.pdf\n",
      "text:\n",
      " ## 5 Training\n",
      "\n",
      "This section describes the training regime for our models.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "query = \"What was the training data used to train Granite models?\"\n",
    "\n",
    "results = do_vector_search (query)\n",
    "print_search_results(results)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "num results :  5\n",
      "------ result 1 --------\n",
      "search score: 0.7932685613632202\n",
      "filename: attention.pdf\n",
      "text:\n",
      " ## Attention Visualizations Input-Input Layer5\n",
      "\n",
      "Figure 3: An example of the attention mechanism following long-distance dependencies in the encoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of the verb 'making', completing the phrase 'making...more difficult'. Attentions here shown only for the word 'making'. Different colors represent different heads. Best viewed in color.\n",
      "\n",
      "<!-- image -->\n",
      "\n",
      "Input-Input Layer5\n",
      "\n",
      "Figure 4: Two attention heads, also in layer 5 of 6, apparently involved in anaphora resolution. Top: Full attentions for head 5. Bottom: Isolated attentions from just the word 'its' for attention heads 5 and 6. Note that the attentions are very sharp for this word.\n",
      "\n",
      "<!-- image -->\n",
      "\n",
      "Input-Input Layer5\n",
      "\n",
      "Figure 5: Many of the attention heads exhibit behaviour that seems related to the structure of the sentence. We give two such examples above, from two different heads from the encoder self-attention at layer 5 of 6. The heads clearly learned to perform different tasks.\n",
      "\n",
      "<!-- image -->\n",
      "\n",
      "------ result 2 --------\n",
      "search score: 0.7886360883712769\n",
      "filename: attention.pdf\n",
      "text:\n",
      " ## 3.2 Attention\n",
      "\n",
      "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum\n",
      "\n",
      "------ result 3 --------\n",
      "search score: 0.7576702237129211\n",
      "filename: attention.pdf\n",
      "text:\n",
      " ## Scaled Dot-Product Attention\n",
      "\n",
      "<!-- image -->\n",
      "\n",
      "Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel.\n",
      "\n",
      "<!-- image -->\n",
      "\n",
      "of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.\n",
      "\n",
      "------ result 4 --------\n",
      "search score: 0.7559462785720825\n",
      "filename: attention.pdf\n",
      "text:\n",
      " ## 3.2.3 Applications of Attention in our Model\n",
      "\n",
      "The Transformer uses multi-head attention in three different ways:\n",
      "\n",
      "- In \"encoder-decoder attention\" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [38, 2, 9].\n",
      "- The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.\n",
      "- Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to -∞ ) all values in the input of the softmax which correspond to illegal connections. See Figure 2.\n",
      "\n",
      "------ result 5 --------\n",
      "search score: 0.7531822919845581\n",
      "filename: attention.pdf\n",
      "text:\n",
      " ## 3.2.1 Scaled Dot-Product Attention\n",
      "\n",
      "We call our particular attention \"Scaled Dot-Product Attention\" (Figure 2). The input consists of queries and keys of dimension d k , and values of dimension d v . We compute the dot products of the query with all keys, divide each by √ d k , and apply a softmax function to obtain the weights on the values.\n",
      "\n",
      "In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q . The keys and values are also packed together into matrices K and V . We compute the matrix of outputs as:\n",
      "\n",
      "<!-- formula-not-decoded -->\n",
      "\n",
      "The two most commonly used attention functions are additive attention [2], and dot-product (multiplicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of 1 √ d k . Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.\n",
      "\n",
      "While for small values of d k the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of d k [3]. We suspect that for large values of d k , the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients 4 . To counteract this effect, we scale the dot products by 1 √ d k .\n",
      "\n"
     ]
    }
   ],
   "source": [
    "query = \"What is the attention mechanism?\"\n",
    "\n",
    "results = do_vector_search (query)\n",
    "print_search_results(results)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# milvus_client.close()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
